fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown#2515
fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown#2515jmacd merged 14 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2515 +/- ##
==========================================
+ Coverage 88.37% 88.39% +0.02%
==========================================
Files 620 622 +2
Lines 228395 230170 +1775
==========================================
+ Hits 201836 203457 +1621
- Misses 26035 26189 +154
Partials 524 524
🚀 New features to boost your workflow:
|
|
The sequencing here looks off. In graceful shutdown the runtime does:
This change makes fake_data_generator do
but that Shutdown is not part of the normal post-drain receiver path. For this receiver, once ingress is stopped there is no receiver-local work left to preserve, so it should exit directly on DrainIngress rather than report drained and then block waiting for another shutdown message. |
|
The correct fix should be:
Something like (not tested): Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
otel_info!("fake_data_generator.drain_ingress");
effect_handler.notify_receiver_drained().await?;
return Ok(TerminalState::new(deadline, [self.metrics.snapshot()]));
} |
|
The fix now looks correct. However from CI failures, there looks like a shutdown race in One option could be to address this in two places:
if signals_per_second.is_some() {
let remaining_time = wait_till - Instant::now();
if remaining_time.as_secs_f64() > 0.0 {
tokio::select! {
biased;
ctrl_msg = ctrl_msg_recv.recv() => {
// handle DrainIngress / Shutdown during the rate-limit wait
// using the same control-message handling as the main loop
}
_ = sleep(remaining_time) => {}
}
}
}
Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
otel_info!("fake_data_generator.drain_ingress");
let _ = effect_handler.notify_receiver_drained().await;
return Ok(TerminalState::new(deadline, [self.metrics.snapshot()]));
} |
91b67d4 to
4862d79
Compare
main.rs but the Dockerfile was not updated to copy the file from the otel-arrow build context, breaking the Docker build. Co-authored-by: Copilot <[email protected]>
Verifies the receiver handles DrainIngress promptly even while sleeping in a rate-limit interval. Without the DrainIngress handler the receiver would stall until the drain deadline expired, causing DrainDeadlineReached. Co-authored-by: Copilot <[email protected]>
| } | ||
| Ok(NodeControlMsg::DrainIngress { deadline, .. }) => { | ||
| otel_info!("fake_data_generator.drain_ingress"); | ||
| let _ = effect_handler.notify_receiver_drained().await; |
There was a problem hiding this comment.
| let _ = effect_handler.notify_receiver_drained().await; | |
| effect_handler.notify_receiver_drained().await?; |
| } | ||
| } | ||
| _ = sleep(remaining_time) => {} | ||
| } |
There was a problem hiding this comment.
The current sleep is making the DrainIngress/Shutdown responsible , but it is also changing the rate-limiting behavior - so any non-terminal control message handled as Ok(None) exist the sleep immediately and next batch can be sent before the original wait_till. We should replace the line 445-456 above with:
// Keep the original sleep deadline even if non-terminal control
// messages arrive. Only DrainIngress/Shutdown should interrupt
// the rate-limit wait early.
let sleep_until = sleep(remaining_time);
tokio::pin!(sleep_until);
loop {
tokio::select! {
biased;
ctrl_msg = ctrl_msg_recv.recv() => {
if let Some(terminal) =
handle_control_msg(ctrl_msg, &effect_handler, &mut self.metrics).await?
{
return Ok(terminal);
}
}
_ = &mut sleep_until => break,
}
}There was a problem hiding this comment.
Please check the new changes + new test.
# Change Summary Adds a necessary Dockerfile line to fix the build. Adds a test to our CI/CD workflow, which would have caught this in #2597. This was going to block #2515 --------- Co-authored-by: Copilot <[email protected]>
9b4b8dc
Change Summary
The "Ack nack redesign" PR (3dca283) introduced a two-phase DrainIngress/ReceiverDrained shutdown protocol but missed updating the fake_data_generator receiver. Without the DrainIngress handler, the message falls into the _ => {} catch-all, notify_receiver_drained() is never called, the pipeline controller never removes the receiver from its pending set, and after the deadline expires it emits DrainDeadlineReached. This was causing pipeline-perf-test-basic to fail consistently.
What issue does this PR close?
pipeline-perf-test-basic unit test is failing.
How are these changes tested?
fake_data_generator and runtime_control_metrics tests were executed.
Are there any user-facing changes?
No, fake_data_generator is an internal test/load-generation receiver, not a user-facing component.